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ABSTRACT 

Hashtags in twitter are used to track events, topics and 
activities. Correlated hashtag graph represents contextual 
relationships among these hashtags. Maximum clusters in 
the correlated hashtag graph can be contextually meaning¬ 
ful hashtag groups. In order to track the changes of the 
clusters and understand these hashtag groups, the hashtags 
in a cluster are categorized into two types: stable core and 
temporary members which are subject to change. Some ini¬ 
tial studies are done in this project and 3 algorithms are 
designed, implemented and experimented to test them. 
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1. INTRODUCTION 

Twitter is one of the most popular online social network¬ 
ing and microblogging service providers in the world. People 
share information using twitter in a format of status mes¬ 
sages called tweets. Since most of these tweets are accessible 
by public, it has become a huge source of information about 
undergoing events, topics and people’s activities. Hashtags 
are written as a combination of keywords, abbreviations or 
argots in order to track the tweets of corresponding events, 
topics or activities. Therefore, meaningful groups of hash- 
tags can represent currently ongoing events, topics and re¬ 
flect people’s interests and opinions. Tracing the changes of 
these groups can further give researcher better ideas about 
mining and understanding people’s interest and opinions. 
To better understand and track the changes, identifying the 
unchanged part of a cluster is very helpful. 

Major work in this project is composed of following: 

• Based on the correlated hashtag graphs, active hashtag 
graphs in a daily basis were constructed to represents 
current active correlations of hashtags. 

• The concepts of stable core and temporary members of 
the hashtag graphs are defined in order to better trace 


the changes of the correlated hashtag groups. The key 
task is to identify the stable cores which has persistent 
relationship among the hashtags while the temporary 
members are assumed to be more about transient in¬ 
terests and opinions which may change later. 

• As an initial attempt of stable core detection, three 
algorithms are designed, implemented and tested, a) 
the Top-N algorithm identify the top N most closely 
related hashtags in a cluster, b) the Above-Average- 
Support method find relationships whose support score 
above the average support of the cluster, c) the Thresh¬ 
old based method view all relationship above a thresh¬ 
old as a part of the stable core. 

2. BACKGROUND 

11 days of tweets are collected via twitter streaming API 
which randomly samples 1 % of incoming twitter status. The 
hashtags are extracted from the tweets. For the hashtags 
which appear in a same tweet, they are assumed to be con¬ 
textual correlated. 

The dynamic correlated hashtag graph is defined as a se¬ 
ries of static undirected graph G f = (H t ,E' t ). V 1 is set of 
the vertexes in the graph, i.e. the set of hashtags found at 
time t. The edges of static graph at time t are defined as 
e t — (y\,v\) where e* G E l and v{,v^ G V 1 . The hashtags 
have edges between them when they are appear in the same 
tweets. The vertexes and edges in the graph are weighted 
by support scores, i.e. the number of users use these hash- 
tags. U(v f ) is the number of users use the hashtag v l and 
U(e t ) is the number of users use the hashtags of e* together 
at time t. Another metric used to evaluate the correlation 
of the hashtags and weight the edges is Jaccard coefficient 
J{v 1,^2) =| vi D V2 | / | vi U V2 |. Similar definitions of 
support score are also used for keyword correlated graphs in 
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The cluster in this project is defined as maximal clique 
[2]. The clusters found by maximal clique detection algo¬ 
rithm can represent meaningful hashtag groups as examples 
in fig 0 fig|H In the related works, changes of clusters are 
usually traced by community similarity/distance 0 0 or set 
relationship (super set, subset) [3]. The stable core studied 
in this project is proposed to be used as possible identity of 
the hashtag clusters. Fig 0 and fig [ 2 ] are examples of clus¬ 
ters based on ’work’ and ’school’ on different dates. Some 
members of the clusters are subject to change because they 
have temporary relationship with each other. For example, 
hashtags like ’Monday’ and ’Wednesday’ become active or 
inactive based on the weekday the hashtags are extracted. 



Figure 1: Clusters on Oct. 2nd 



Figure 2: Clusters on Oct. Tth 

On the contrary, some hashtags like ’work’ and ’school’ have 
persistent correlations. Therefore, clusters’ identity may be 
able to marked by the persistent part of the clusters plus 
some special ones from the temporary part. In this project, 
some exploration of finding the persistent part of the cluster, 
i.e. the stable core, has been done. 

2.1 Active Correlated Hashtag Graph 

The active correlated hashtag graph (AHG) is a prepro¬ 
cessed graph of correlated hashtag graph. The hashtags and 
edges existing in the active graph are decided by the thresh¬ 
olds of hashtag support score Thr v , edge support score Thr e 
and Jaccard coefficient of the edges Thrj. So the AHG is 
defined as G\ C G l where for all v t , U(v*) > Thr v and for 
all e*, U{e f ) > Thr e A J(y 1 ,^ 2 ) > Thrj where e* = 

2.2 Stable Core 

The stable core is the persistent sub-cliques of a clus¬ 
ter. The clusters (i.e. maximal cliques) in AHG G\ are 
C(G t A )- The stable cores SC(C(G t A ), N) are the sub-cliques 
of C'(G^) that exist in the sub-cliques of C(G t J~ N ). For in¬ 
stance, there is a cluster, [’’work”,’’school”,’’Wednesday”], in 
C{G° a ) and there is a cluster in C(G\), [’’work”,’’school”,”mon 
One of stable cores is [’’work”,’’school”] in SC(C(Ga)i 5). 

3. METHOD 

Intuitively, the more people support the usage of certain 
two hashtags together, the more the relationship is likely to 
exist for a longer time. Three approaches of finding stable 
cores based on the support score of edges in snapshot of AHG 
are explored. The input to these approaches is a cluster and 


TOP-N_STABLE_CORE: 

Input: Cluster C, Core size N 
Output: Core of C 
Core <— 0 

if len(C ) >= N then 

Core {Hashtags in Top_Scored_Edge(G)} 
while len(Core) < N do 

Core — CoreG {tag} where tag £ {C\ Core} Atag £ 
Top_Scored_Edge( Neighbors (Core) fl {C \ Core}) 

end while 
end if 

Figure 3: Top-N stable core detection 
ABOVE_AVERAGE_CORE: 

Input: Cluster C 
Output: Core of C 

Core £- {Hashtags in Top_Scored_Edge(C f )} 

for all tag £ C do 

if tag £ CoreA(Vc £ Core : Sup(tag, c) > AvgSup{C)) 

then 

Core <— Core U {tag} 

end if 
end for 

Figure 4: AA stable core detection 

output is one stable core of the input cluster. 

3.1 Top-N 

The Top-N method first finds the top scored edge in the 
input cluster as the initial stable core. Then find the top 
edges left in the cluster connecting to hashtags in the stable 
core repeatedly until N hash tags found for the stable core. 
The method is described in fig(3] The size of the stable cores 
are fixed in the method but the absolute support scores of 
these edges are not restricted. 

3.2 Above Average Support 

The Above Average Support (AA) also starts from the 
top scored edge in the input cluster. Then the hashtags, 
which have edges with support score above the average of 
the cluster to all of the hashtags in the core, are added into 
the stable core. The method is described in fig. [4] The 
average support AvgSup(c) is the average support of clique 
c. The size of the stable cores are not fixed but the absolute 
support scores of these edges in the sable cores are still not 
restricted in this method. 

3.3 Edge Threshold 

In this method as described in fig. [5] only the edge sup- 
port threshold is considered. This approach is also started 
”] from the top supported edge and the process is very similar 
the A A method. The only difference is that instead of the 
average support of the cluster, a fixed edge support is used 
to decide which hashtag will be added into the core. 

4. EXPERIMENTS 

The data was crawled from twitter between 10/2/2013 
and 10/12/2013. The snapshots of the dynamic graph are 
taken each day. For each of the stable core detection meth- 







EDGE_THREASHOLD_CORE: 


Input: Cluster C, Threshold Thr 
Output: Core of C 

Core <— {Hashtags in Top_Scored_Edge((7)} 
if AvgSupiCore) > Thr then 
for all tag £ C do 

if tag £ Core A (Vc £ Core : Sup(tag, c) > Thr ) 

then 

Core <— Core U {tag} 

end if 
end for 
else 

Core <— 0 

end if 


Figure 5: Edge Threshold 



ods, an experiment based on the first day of our data was 
done to find how well they can figure out the stable cores 
in the following days. The performance of the methods are 
evaluated by the real stable core ratio of SC(G A , N ), which 
is the rate of the cores found on G° A surviving on the iVth 
day, i.e. | G Q A | /(| G A | fl | G A |). Then the 3 methods 
were compared based on general real stable core ratio and 
amount of cores found. 

4.1 Performance of Top-N 

The N in Top-N is configured to 3 in this experiment. The 
stable core ratios of SC(G A ,n) where n £ [1,10] are calcu¬ 
lated. Each of lines shown in fig [6] stands for the changes 
of real stable core ratio of cores having a specific range of 
average support score. The x-axis is for days and the y-axis 
is for the stable core ratio. The higher average support score 
the cores have, the higher ratio they will still active from 1st 
to 10th day after the beginning day. As the day progresses, 
the fraction of survived cores goes down when the average 
support score is relatively high. However, for the relatively 
low average support score, the ratio does not change much- 
they all are equally low. Therefore, the cores found using 
this method with high average support score are more likely 
to be real stable cores. 

4.2 Performance of AA 

From fig0 we can see the pattern of performances of 
above average support method is very similar to the Top- 
N methods excepting that the highest stable core ratio in 
the top range of average support scored cores is higher. 

4.3 Performance of Edge Threshold 

The performance of the edge support threshold method 





Figure 7: Real Core Ratios of AA 



is shown in fig [8] From the observations of previous experi¬ 
ments, we found when the average support score of the cores 
is above 6, the real stable core ratio is above 0.6. So in this 
experiment, the edge support threshold is set to 6. The 
numbers fluctuate more than those compared to the previ¬ 
ous two methods. It is probably because the edges in a core 
may have relatively low support score even if the average 
support score is high. 

4.4 Comparison 

The performance is evaluated by real cores found and real 
core ratios. The high scores in these two metrics mean bet¬ 
ter performance. These two number relates to the recall 
and precision of the stable core detection method. The re¬ 
call (ratio of total stable cores found by our algorithms) can 
not be easily calculated because we can not enumerate all 
the stable cores in such a large data set as twitter. But 
recall is directly proportional to the number of stable cores 
found by these algorithms. So we use the amount of stable 
cores to compare between different algorithms. In the com¬ 
parison experiment, the methods are evaluated for detect¬ 
ing SC(G° a ,7). The fig|9] shows the cores found by Top-N 
method with different configuration, AA methods and edge 
threshold methods with threshold set to 6. The amount of 
cores found by the Top-N method is the most among these 
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Figure 9: Cores Found 
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Figure 10: Real Core Ratios 
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Figure 11: Average Core Score 


evaluated methods but the fig [To] shows that the real core ra¬ 
tios of all Top-N methods tested are low. The above average 
support method and edge support threshold method can find 
more cores than the Top-N methods excepting Top-3. The 
AA and edge threshold have much higher real core ratio than 
the Top-N ones. Thus, based only on these two metrics, the 
AA and edge threshold have better performance than the 
Top-N methods. Also, according to these figures, edge sup¬ 
port threshold method with threshold=6 has slightly higher 
value in these two metrics comparing to the A A method. 
The fig [IT] shows that although Top-N with high N values 
has high average core score but these big ones still tend to 
split or disappear in the future. It is probably because some 
edges with relative low support are also included in the sta¬ 
ble cores, which leads to unstable. 

5. CONCLUSION AND FUTURE WORK 

From the above discussion, the Above Average support 
method and Edge Threshold method are more promising 
than the Top-N method. A A can find stable cores with 
very high precision in terms of real core ratio when average 
core score is high. Edge threshold method has a slightly 
general higher performance but may be more easily involve 
some edges with relatively lower than the average support 
scores in the cluster, which may be a reason why it has a 
slightly poor performance comparing to AA when the aver¬ 
age score of the core is high. Therefore, combing these two 
methods may lead us to a better real core ratio but, to find 
stable cores as most as possible, more studies and thinking 


are needed. In order to improve both of the real core ratio 
and amount of cores detected, more factors like amount of 
clusters a core belonging to, historical appearance of a core, 
support changes of a edge should be taken into considera¬ 
tion. Also, more powerful prediction methods are needed 
to try. Also, metrics for quality of the cores found need 
to be better defined, how informative of a core is, what is 
the proper size of the cores need to be studied. Another 
issue is multiple stable cores may exist in a cluster. Current 
approach assumes that one stable core per cluster. Addi¬ 
tionally, selection of temporary member for cluster identity 
is another part of work to complete the cluster identity and 
tracing problem. Not only be used to track changes as iden¬ 
tity of the hash tag clusters, distinguishing the stable part 
and temporary members of the clusters can be a part of ap¬ 
plications like opinion mining and hash tag recommendation 
system because both of the stable members and temporary 
members should be considered accordingly to reflect differ¬ 
ent aspects and temporary popular topics/events of relative 
hashtags. 
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